-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve non-missingness during non-inner joins #1316
Conversation
Travis Mac CI stalled during Julia image download. other tests ok |
Last time we considered this we voted against it because we were still using NullableArrays and we wanted to consistently interact with the columns using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I agree about the "type-stable" part, but I'm not a fan of changing the returned type depending on the data. Apart from making the behavior hard to predict, it won't work with a fully type-stable variant of DataFrame
that I've considered supporting as an alternative. Do you really think it's common to make a left/right join without any missing values? The point of these operations is that you expect some rows to be missing from the second table, or you can use an inner join instead.
src/abstractdataframe/join.jl
Outdated
_similar = kind == :inner ? similar : similar_missing | ||
# inner and left joins preserve non-missingness of the left frame | ||
# it is also preserved if all right rows have left matches | ||
_similar_left = kind == :inner || kind == :left || length(rightonly_ixs.join) == 0 ? similar : similar_missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the condition is very long, better use if ...; _similar_left = ...
src/abstractdataframe/join.jl
Outdated
on_col_ix = findfirst(joiner.left_on, names(joiner.dfl)[i]) | ||
if on_col_ix > 0 && kind == :right | ||
# if right join, construct the on-column | ||
# using the right frame to preserve missingness and cat.levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "cat.levels"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categorical levels :) I'll expand it
src/abstractdataframe/join.jl
Outdated
rcol = joiner.dfr_on[on_col_ix] | ||
cols[i] = similar(rcol, nrow) | ||
copy!(cols[i], view(rcol, all_orig_right_ixs)) | ||
permute!(cols[i], right_perm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there an alternative solution which doesn't involve calling permute!
? I'm concerned about the performance impact compared with the previous code which simply called copy!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case only for the right joins. Previously it was handled later in the code (if length(rightonly_ixs) > 0
block).
Actually, now it should be faster, because we compose the resulting column using the right frame directly, instead of initializing it using the left frame and fixing it later.
The problem of the inner join is that it "swallows" the rows that do not have matches, and checking for that is more expensive (performance- and code size-wise) than the introduction of missing values and checking the resulting column type.
|
It's not an issue if we use an
Given that we have |
I'm not sure I follow. By "internally" you mean within
See the usecase above, the keyword is equivalent to applying
In SQL you have "INSERT INTO tbl SELECT ...", which checks that the result of SELECT fits the constraints of "tbl". EDIT: I was wrong that "strict" is just a specific type of "inner" join. There could also be "strict left" (all left rows should have matching right rows) and "strict right" joins. |
I've updated the commit removing the data-dependent rules. |
CI passes on 0.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Indeed, better merge the uncontroversial parts first an discuss more problematic ones later.
src/abstractdataframe/join.jl
Outdated
for (on_col_ix, on_col) in enumerate(joiner.left_on) | ||
# fix the result of the rightjoin by taking the nonmissing values from the right table | ||
offset = nrow - length(rightonly_ixs.orig) + 1 | ||
copy!(res[on_col], offset, view(joiner.dfr_on[on_col_ix], rightonly_ixs.orig)) | ||
end | ||
end | ||
if kind == :outer && !isempty(rightonly_ixs.join) | ||
# some non-missing on-columns may have become missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"missing column" can be confusing as it's not clear whether it's the column which might be missing. Better say something like "columns allowing for missing" even if it's a bit long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. In the long run would be nice to have a term for it (even the ugly internal one, which would be used in cases like this).
src/abstractdataframe/join.jl
Outdated
for (on_col_ix, on_col) in enumerate(joiner.left_on) | ||
LT = eltype(joiner.dfl_on[on_col_ix]) | ||
RT = eltype(joiner.dfr_on[on_col_ix]) | ||
if Missings.T(LT) === LT && Missings.T(RT) === RT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this just be !(LT >: Missing || RT >: Missing)
? That's more explicit IMHO. It could also be worth adding a special case for Any
columns, for which there's no point in calling disallowmissing
since it won't have any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because Any >: Missing
. But you're right, in this context we don't need to fix Any
columns. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Missings.T(Any) === Any
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, so the code would have called disallowmissing
, which is wrong.
Since !(Any >: Missing || T >: Missing) === false
, it would not be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a testset for joins with Any eltypes.
test/join.jl
Outdated
@@ -346,8 +343,8 @@ module TestJoin | |||
@test levels(join(B, A, on=:b, kind=:inner)[:b]) == ["a", "b", "c"] | |||
@test levels(join(A, B, on=:b, kind=:left)[:b]) == ["d", "c", "b", "a"] | |||
@test levels(join(B, A, on=:b, kind=:left)[:b]) == ["a", "b", "c"] | |||
@test levels(join(A, B, on=:b, kind=:right)[:b]) == ["d", "c", "b", "a"] | |||
@test levels(join(B, A, on=:b, kind=:right)[:b]) == ["a", "b", "d", "c"] | |||
@test levels(join(A, B, on=:b, kind=:right)[:b]) == ["a", "b", "c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this change? The docstring for join
gives guarantees about the levels, so either it wasn't correct before or it might need to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR the types and categorical levels of the right on-columns have priority over the left ones.
I think it's a logical thing. Also it's easier to implement.
So I can either patch the PR to the previous behaviour or update the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it's not too hard to implement, I'd choose the rule which is the simplest to explain in the docstring, which is a good indication that it's easy to understand. IIUC, the new behavior you propose is to give priority to the left column, except for kind=:right
? I'm not too fan of exceptions, even if I see the justification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted to the old category levels ordering.
- left and right joins preserve non-missingness of the left and right frame columns, respectively - non-missingness of the on-columns is preserved (if non-missing in both frames)
squashed, in case you want to merge it preserving the commits |
Currently, if you do a few joins, most of the columns in your data frame become "missable" (
>: Missing
).That's because so far
join
preserves non-missingness only for inner joins:This PR adds the other logical rules:
The rules above are "type stable", so to say. I.e. they don't depend on the contents of the frames.
But I thought it also makes sense to introduce data-dependent rules (there's no performance penalty in checking them):
As a reasonable side effect of the rules implementation, when doing the right join, the eltype (and levels, for categorical arrays) of the right on-column has priority over the left on-column.